[レポート] Accelerate FM pre-training on Amazon SageMaker HyperPod #AWSreInvent

AWS re:Invent 2024
#SageMaker HyperPod
#AWS
たかくに
2024.12.03
こんにちは！AWS 事業本部コンサルティング部のたかくに（@takakuni_）です。
re:Invent 2024 でラスベガスに来ています。
https://reinvent.awsevents.com/
「Accelerate FM pre-training on Amazon SageMaker HyperPod」という、面白そうなワークショップがあったので参加してみました。
 セッション概要 タイトルAIM403 | Accelerate FM pre-training on Amazon SageMaker HyperPod
 説明Amazon SageMaker HyperPod removes the undifferentiated heavy lifting involved in building and optimizing machine learning (ML) infrastructure for training foundation models (FMs), reducing training time by up to 40%. In this builders’ session, learn how to pre-train a large language model (LLM) using Slurm on SageMaker HyperPod. Explore the model pre-training workflow from start to finish, including setting up clusters, troubleshooting convergence issues, and running distributed training to improve model performance. You must bring your laptop to participate.
 スピーカーSean Smith, Solution Architect, AWS
Keita Watanabe, Sr. Solutions Architect, AWS
Pooja Karadgi, Senior Technical Product Manager, Amazon Web Services
Matthew Nightingale, Sr Specialist Solutions Architect, AWS
Aman Shanbhag, Specialist Solutions Architect, Amazon Web Services
 内容アジェンダは以下の通りです。内容としては Amazon SageMaker HyperPodのショートカットバージョンです。
Prerequisites
Cluster Setup
a. Easy cluster setup
b. AWS Consol
c. SSH into Cluste
d. Get to know your Cluster

PyTorch DDP on CPU
Cleanup
ロングバージョンも用意されています。自分のアカウントでできるため興味があればチャレンジしてみてください。
https://catalog.workshops.aws/sagemaker-hyperpod/en-US
 Cluster Setup今回はオーケストレーターに Slurm を利用した SageMaker Hyperpod を利用しました。
EKS を利用しないため、SageMaker HyperPod の設定ファイルを作成します。
設定には automate-cluster-creation.sh を使いました。
https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/automate-smhp-slurm/automate-cluster-creation.sh
応答形式で設定ファイルを作れるのでとても簡単でした。
# Clone the repository
mkdir hyperpod && cd hyperpod

curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/refs/heads/aim_403_riv_2024/1.architectures/5.sagemaker-hyperpod/automate-smhp-slurm/automate-cluster-creation.sh

# Run the script
bash automate-cluster-creation.sh

aws sagemaker list-clusters --output table
10 分くらい待つと SageMaker HyperPod クラスターができました。
sagemaker-user@default:~/hyperpod$ aws sagemaker list-clusters --output table
-----------------------------------------------------------------------------------------------------------------------------------------
|                                                             ListClusters                                                              |
+---------------------------------------------------------------------------------------------------------------------------------------+
||                                                          ClusterSummaries                                                           ||
|+----------------------------------------------------------------+--------------+----------------+------------------------------------+|
||                           ClusterArn                           | ClusterName  | ClusterStatus  |           CreationTime             ||
|+----------------------------------------------------------------+--------------+----------------+------------------------------------+|
||  arn:aws:sagemaker:us-west-2:960663929760:cluster/i1yddqtc1sbg |  ml-cluster  |  InService     |  2024-12-02T22:48:33.846000+00:00  ||
|+----------------------------------------------------------------+--------------+----------------+------------------------------------+|
sagemaker-user@default:~/hyperpod$
コンソールからも確認できます。ヘッドノード 1 台とワーカーノード 4 台が動いていますね。
on_create.sh でノードの初期セットアップを行います。
https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config
 SSH でログインSSH でヘッドノードにログインします。
session-manager-plugin をインストールしたのちに、HyperPod Cluster Easy SSH Script を流せばログイン可能でした。いろいろシェルが用意されていますね。
sagemaker-user@default:~/hyperpod$ ls
automate-cluster-creation.sh  awsome-distributed-training  cluster-config.json  env_vars  provisioning_parameters.json  validate-config.py
sagemaker-user@default:~/hyperpod$ awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh  -c controller-machine ml-cluster

=================================================

==== 🚀 HyperPod Cluster Easy SSH Script! 🚀 ====

=================================================
Cluster id: i1yddqtc1sbg
Instance id: i-065f9096563898630
Node Group: controller-machine
grep: /home/sagemaker-user/.ssh/config: No such file or directory
Would you like to add ml-cluster to  ~/.ssh/config (yes/no)?
>
 Slurm コマンド実行ヘッドノードにログインしたのちに Slurm コマンドを実行します。ワーカーノードは 4 台あるものの、キューにはジョブがないようです。
ubuntu@ip-10-1-93-50:~$ sinfo
PARTITION     AVAIL  TIMELIMIT  NODES  STATE NODELIST
dev*             up   infinite      4   idle ip-10-1-20-177,ip-10-1-81-[107,236],ip-10-1-108-171
ml.c5.4xlarge    up   infinite      4   idle ip-10-1-20-177,ip-10-1-81-[107,236],ip-10-1-108-171
ubuntu@ip-10-1-93-50:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
ubuntu@ip-10-1-93-50:~$
 ワーカーノードワーカーノードの 1 台にログインしてみます。
salloc -N 1
ssh $(srun hostname)
ログインできました。ロゴかっこいいですね。
System information や Security Maintenance の情報も閲覧できますね。
ubuntu@ip-10-1-93-50:~$ ssh $(srun hostname)
Warning: Permanently added 'ip-10-1-20-177' (ECDSA) to the list of known hosts.
========================================================================================
    ____              __  ___     __             __ __                  ___          __
   / __/__ ____ ____ /  |/  /__ _/ /_____ ____  / // /_ _____  ___ ____/ _ \___  ___/ /
  _\ \/ _ `/ _ `/ -_) /|_/ / _ `/  '_/ -_) __/ / _  / // / _ \/ -_) __/ ___/ _ \/ _  /
 /___/\_,_/\_, /\__/_/  /_/\_,_/_/\_\\__/_/   /_//_/\_, / .__/\__/_/ /_/   \___/\_,_/
          /___/                                    /___/_/
                          HyperPod Instance AMI (Ubuntu 20.04)
========================================================================================

Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-1072-aws x86_64v)

Utility libraries are installed in /usr/bin/python3.9.
To access them, use /usr/bin/python3.9.

AWS Deep Learning AMI Homepage: https://aws.amazon.com/machine-learning/amis/
Release Notes: https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html
Support: https://forums.aws.amazon.com/forum.jspa?forumID=263
For a fully managed experience, check out Amazon SageMaker at https://aws.amazon.com/sagemaker
=============================================================================

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/pro

 System information as of Mon Dec  2 23:11:29 UTC 2024

  System load:  0.06               Processes:             297
  Usage of /:   52.6% of 96.73GB   Users logged in:       0
  Memory usage: 4%                 IPv4 address for ens6: 10.1.20.177
  Swap usage:   0%

 * Ubuntu Pro delivers the most comprehensive open source security and
   compliance features.

   https://ubuntu.com/aws/pro

Expanded Security Maintenance for Applications is not enabled.

23 updates can be applied immediately.
19 of these updates are standard security updates.
To see these additional updates run: apt list --upgradable

41 additional security updates can be applied with ESM Apps.
Learn more about enabling ESM Apps service at https://ubuntu.com/esm

To replace an instance run:
   sudo scontrol update node=<hostname> state=fail reason="Action:Replace"

To automatically resume jobs, please add the following in your job submission script:
   srun --auto-resume=1

Instance Type: c5.4xlarge

=============================================================================
AMI Name: Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 20.04)
Supported EC2 instances: G4dn, G5, G6, Gr6, G6e, P4d, P4de, P5, P5e
NVIDIA driver version: 550.127.05
CUDA versions available: cuda-12.1 cuda-12.2 cuda-12.3 cuda-12.4
Default CUDA version is 12.1

Release notes: https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html
AWS Deep Learning AMI Homepage: https://aws.amazon.com/machine-learning/amis/
Developer Guide and Release Notes: https://docs.aws.amazon.com/dlami/latest/devguide/what-is-dlami.html
Support: https://forums.aws.amazon.com/forum.jspa?forumID=263
For a fully managed experience, check out Amazon SageMaker at https://aws.amazon.com/sagemaker
=============================================================================

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

   ____              __  ___     __             __ __                  ___          __
  / __/__ ____ ____ /  |/  /__ _/ /_____ ____  / // /_ _____  ___ ____/ _ \___  ___/ /
 _\ \/ _ `/ _ `/ -_) /|_/ / _ `/  '_/ -_) __/ / _  / // / _ \/ -_) __/ ___/ _ \/ _  /
/___/\_,_/\_, /\__/_/  /_/\_,_/_/\_\\__/_/   /_//_/\_, / .__/\__/_/ /_/   \___/\_,_/
         /___/                                    /___/_/

To replace an instance run:
   sudo scontrol update node=<hostname> state=fail reason="Action:Replace"

To automatically resume jobs, please add the following in your job submission script:
   srun --auto-resume=1

You're on the compute
Controller Node IP: 10.1.93.50
Instance Type: ml.c5.4xlarge
 PyTorch DDP on CPUPyTorch を使った CPU (ml.m5.2xlarge) インスタンスでの分散データ並列（DDP）を行います。
Docker ベースと Conda ベースがありました。今回は Docker ベースを選びました。
Enroot を利用して Docker イメージをファイル管理にし、 sbatch でジョブを投入します。
bash 2.create-enroot-image.sh
sbatch 3.container-train.sbatch
ログを確認してみます。
tail -f logs/cpu-ddp-container_*.out
以下を例にすごい量のログが出ていました。
0: [RANK 1] Epoch 6757 | Batchsize: 32 | Steps: 8
0: [RANK 2] Epoch 6757 | Batchsize: 32 | Steps: 8
0: [RANK 3] Epoch 6757 | Batchsize: 32 | Steps: 8
0: [RANK 0] Epoch 6757 | Batchsize: 32 | Steps: 8
1: [RANK 4] Epoch 6757 | Batchsize: 32 | Steps: 8
1: [RANK 5] Epoch 6757 | Batchsize: 32 | Steps: 8
1: [RANK 6] Epoch 6757 | Batchsize: 32 | Steps: 8
1: [RANK 7] Epoch 6757 | Batchsize: 32 | Steps: 8
0: [RANK 1] Epoch 6758 | Batchsize: 32 | Steps: 8
0: [RANK 2] Epoch 6758 | Batchsize: 32 | Steps: 8
0: [RANK 0] Epoch 6758 | Batchsize: 32 | Steps: 8
0: [RANK 3] Epoch 6758 | Batchsize: 32 | Steps: 8
1: [RANK 4] Epoch 6758 | Batchsize: 32 | Steps: 8[RANK 5] Epoch 6758 | Batchsize: 32 | Steps: 8
1:
1: [RANK 6] Epoch 6758 | Batchsize: 32 | Steps: 8
1: [RANK 7] Epoch 6758 | Batchsize: 32 | Steps: 8
 モニタリング最後にジョブが実行されているノードの CPU 使用率を確認してみます。
まずは sinfo で alloc になっているノードを特定します。
ubuntu@ip-10-1-93-50:~/awsome-distributed-training/3.test_cases/16.pytorch-cpu-ddp/slurm$ sinfo
PARTITION     AVAIL  TIMELIMIT  NODES  STATE NODELIST
dev*             up   infinite      2  alloc ip-10-1-20-177,ip-10-1-81-107
dev*             up   infinite      2   idle ip-10-1-81-236,ip-10-1-108-171
ml.c5.4xlarge    up   infinite      2  alloc ip-10-1-20-177,ip-10-1-81-107
ml.c5.4xlarge    up   infinite      2   idle ip-10-1-81-236,ip-10-1-108-171
``
SSH でログインして htop を使ってヘッドノードのメトリクスを確認します。
ssh ip-10-1-20-177
sudo apt-get install -y htop && htop
45 %前後で動いていますね。
 まとめ以上、「[レポート] Accelerate FM pre-training on Amazon SageMaker HyperPod」でした。
午前に EKS の近い内容でセッションを受けていましたが、オーケスとレーターや監視周りの手法に違いがあり、大変学びになりました。
再掲になりますが、ロングバージョンも用意されています。自分のアカウントでできるため興味があればチャレンジしてみてください。
https://catalog.workshops.aws/sagemaker-hyperpod/en-US
このブログがどなたかの参考になれば幸いです。AWS 事業本部コンサルティング部のたかくに（@takakuni_）でした！
[レポート] Accelerate FM pre-training on Amazon SageMaker HyperPod #AWSreInvent

セッション概要

タイトル

説明

スピーカー

内容

Cluster Setup

SSH でログイン

Slurm コマンド実行

ワーカーノード

PyTorch DDP on CPU

モニタリング

まとめ

関連記事

主なカテゴリ

AWSで探す

注目のテーマ

プロダクトやサービスで探す

特集やシリーズから探す

お問い合わせ

運営会社